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ABSTRACT 

Privacy  is  an  important  issue  in  data  mining  and 
knowledge  discovery.  In  this  paper,  we  propose  to  use 
the  randomized  response  techniques  to  conduct  the 
data  mining  computation.  Specifically,  we  present  a 
method  to  build  naive  Bayesian  classifiers  from  the 
disguised  data.  We  conduct  experiments  to  compare 
the  accuracy  of  our  classifier  with  the  one  built  from 
the  original  undisguised  data.  Our  results  show  that 
although  the  data  are  disguised,  our  method  can  still 
achieve  fairly  high  accuracy. 
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1  Introduction 

Data  mining  has  emerged  as  a  means  for  identifying 
patterns  and  trends  from  a  large  amount  of  data  [7]. 
To  conduct  data  mining  computations,  we  need  to  col¬ 
lect  data  first.  Without  privacy  concerns,  data  can  be 
directly  collected.  However,  because  of  privacy  con¬ 
cerns,  some  people  might  decide  to  selectively  divulge 
information,  or  give  false  information,  or  simply  refuse 
to  disclose  any  information  at  all.  A  survey  was  con¬ 
ducted  in  1999  [3]  to  understand  Internet  users’  atti¬ 
tudes  towards  privacy.  The  result  shows  17%  of  re¬ 
spondents  are  privacy  fundamentalists,  who  are  ex¬ 
tremely  concerned  about  any  use  of  their  data  and 
generally  unwilling  to  provide  their  data,  even  when 
privacy  protection  measures  were  in  place.  However, 
56%  of  respondents  are  a  pragmatic  majority,  who  are 
also  concerned  about  data  use,  but  are  less  concerned 
than  the  fundamentalists;  their  concerns  are  often  sig¬ 
nificantly  reduced  by  the  presence  of  privacy  protec¬ 
tion  measures.  The  remaining  27%  are  marginally  con¬ 
cerned  and  are  generally  willing  to  provide  data  under 
almost  any  condition,  although  they  often  expressed  a 
mild  general  concern  about  privacy.  According  to  this 
survey,  providing  privacy  protection  measures  is  a  key 
to  the  success  of  data  collection.  How  can  we  improve 
the  chance  to  collect  more  truthful  data  that  are  useful 


for  data  mining  while  preserving  users  ’  privacy 1  How 
can  users  contribute  their  personal  information  with¬ 
out  compromising  their  privacy 1 

One  way  to  achieve  privacy  is  to  use  anonymous 
techniques  [1],  which  allow  users  to  disclose  their  per¬ 
sonal  information  without  disclosing  their  identities. 
The  biggest  problem  of  using  anonymous  techniques 
is  that  there  is  no  guarantee  on  the  quality  of  the  data 
set.  A  malicious  user  (e.g.,  a  competing  company) 
could  send  a  great  deal  of  random  information  to  the 
database  and  render  the  database  useless,  or  a  com¬ 
pany  could  send  a  lot  of  made-up  information  to  the 
database  with  the  goal  of  making  their  products  the 
most  favorable  ones.  These  potential  attacks  could  all 
render  the  database  useless.  If  the  communication  is 
really  anonymous,  it  is  difficult  for  the  database  owner 
to  control  the  quality  of  the  data.  To  guarantee  the 
quality,  it  is  important  for  the  database  owner  to  verify 
the  identities  of  the  data  contributors. 

Another  way  to  achieve  privacy  is  to  let  each  user 
disguise  or  randomize  their  data,  such  that  the  data 
collector  cannot  derive  the  truthful  information  about 
an  user’s  private  information.  The  challenge  is  how 
to  conduct  data  mining  from  the  disguised  data?  To 
address  this  challenge,  we  first  propose  the  following 
computing  model:  The  model  consists  of  a  data  collec¬ 
tion  step  and  a  computation  step.  In  the  data  collec¬ 
tion  step,  each  user  utilizes  certain  techniques  to  dis¬ 
guise  his/her  data,  then  sends  the  disguised  data  to  the 
central  warehouse;  the  central  warehouse  should  not 
be  able  to  find  out  any  user’s  actual  data  with  prob¬ 
abilities  better  than  a  pre-defined  threshold.  In  the 
computation  step,  the  central  warehouse  constructs  a 
database  using  the  disguised  data,  and  conducts  data 
mining  computations  on  this  database.  The  goal  of 
the  central  warehouse  is  to  derive  useful  information 
(or  knowledge)  out  of  this  disguised  database.  In  this 
paper,  we  particularly  focus  on  a  specific  data  min¬ 
ing  computation,  the  naive  Bayesian  (e.g.,  NB)  based 
classification  [8].  The  basic  idea  of  NB  classification 
is  to  construct  a  NB  network,  which  is  a  very  sim¬ 
ple  Bayesian  network  with  an  assumption  that  every 
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including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2004 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2004  to  00-00-2004 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Privacy-Preserving  Naive  Bayesian  Classification 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROIECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Naval  Research  Laboratory, Center  for  High  Assurance  Computer 

Systems, 4555  Overlook  Avenue,  SW, Washington, DC, 20375 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBIECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

7 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


variable  (feature)  of  the  data  is  independent  given  the 
class  label,  to  conduct  the  classification. 

We  propose  to  use  the  Randomized  Response  tech¬ 
niques  to  solve  the  privacy-preserving  data  mining 
problem.  The  basic  idea  of  randomized  response  is 
to  scramble  the  data  in  such  a  way  that  the  central 
warehouse  cannot  tell  with  probabilities  better  than 
a  pre-defined  threshold  whether  the  data  from  a  cus¬ 
tomer  contain  truthful  information  or  false  informa¬ 
tion.  Although  information  from  each  individual  user 
is  scrambled,  if  the  number  of  users  is  significantly 
large,  the  aggregate  information  of  these  users  can  be 
estimated  with  decent  accuracy.  Such  property  is  use¬ 
ful  for  naive  Bayesian  based  classification  since  it  is 
based  on  aggregate  values  of  a  data  set,  rather  than 
individual  data  items. 

The  contributions  of  this  paper  are  as  follows: 
(1)  We  have  modified  the  naive  Bayesian  classification 
algorithm  [8]  to  make  it  work  with  data  modified  by 
randomized  response  techniques,  and  implemented  the 
modified  algorithm.  (2)  We  then  conducted  a  series  of 
experiments  to  measure  the  accuracy  of  our  modified 
naive  Bayesian  algorithm  on  randomized  data.  Our 
results  show  that  if  we  choose  the  appropriate  random¬ 
ization  parameters,  the  accuracy  we  have  achieved  is 
very  close  to  the  accuracy  achieved  using  the  original 
naive  Bayesian  classification  on  the  original  data. 

The  rest  of  the  paper  is  organized  as  follows:  we 
discuss  related  work  in  Section  2.  In  Section  3,  we  de¬ 
scribe  how  to  utilize  multivariate  randomized  response 
technique  to  build  naive  Bayesian  classifier  on  random¬ 
ized  data.  In  Section  4,  we  describe  our  experimental 
results.  We  give  our  conclusion  in  Section  5. 

2  Related  Work 

Agrawal  and  Srikant  proposed  a  scheme  for  privacy¬ 
preserving  data  mining  using  random  perturbation  [2] . 
In  their  scheme,  a  random  number  is  added  to  the 
value  of  a  sensitive  attribute.  For  example,  if  Xj  is  the 
value  of  a  sensitive  attribute,  xt  +  r,  rather  than  Xj, 
will  appear  in  the  database,  where  r  is  a  random  value 
drawn  from  some  distribution.  The  paper  shows  that 
if  the  random  number  is  generated  with  some  known 
distribution  (e.g.,  uniform  or  Gaussian  distribution), 
it  is  possible  to  recover  the  distribution  of  the  values 
of  that  sensitive  attribute.  Assuming  independence  of 
the  attributes,  the  paper  then  shows  that  a  decision 
tree  classifier  can  be  built  with  the  knowledge  of  dis¬ 
tribution  of  each  attribute. 

Rizvi  and  Haritsa  presented  a  scheme  called 
MASK  to  mine  associations  with  secrecy  constraints 
in  [10].  Evfknievski  et  al.  proposed  an  approach 
to  conduct  privacy  preserving  association  rule  min¬ 
ing  based  on  randomization  techniques  [6].  Du  and 
Zhan  [5]  utilized  randomized  response  technique  for 
decision  tree  classification. 


There  are  currently  two  approaches  to  achieve 
privacy-preserving  data  mining:  one  is  the  perturba¬ 
tion  approach  which  we  had  discussed  in  the  above. 
The  other  approach  is  to  use  Secure  Multi-party  Com¬ 
putation  (SMC)  techniques  [14].  Several  SMC-basecl 
privacy-preserving  data  mining  schemes  have  been 
proposed  [15,  4,  9,  12].  These  studies  mainly  focused 
on  two-party  distributed  computing,  and  each  party 
usually  contributes  a  set  of  records.  Although  some  of 
the  solutions  can  be  extended  to  solve  our  problem  (n 
party  problem) ,  the  performance  is  not  desirable  when 
n  becomes  large. 

3  Building  Naive  Bayesian  Classifiers 
Using  Multivariate  Randomized  Re¬ 
sponse  Techniques 

Randomized  Response  techniques  were  first  introduced 
by  Warner  [13]  in  1965  as  a  technique  to  solve  the 
following  survey  problem:  to  estimate  the  percentage 
of  people  in  a  population  that  has  attribute  A,  queries 
are  sent  to  a  group  of  people.  Since  the  attribute  A 
is  related  to  some  confidential  aspects  of  human  life, 
respondents  may  decide  not  to  reply  at  all  or  to  reply 
with  incorrect  answers. 

To  enhance  the  level  of  cooperation,  instead  of 
asking  each  respondent  whether  he/she  has  attribute 
A,  the  interviewer  asks  each  respondent  two  related 
questions,  the  answers  to  which  are  opposite  to  each 
other  [13].  For  example,  the  questions  could  be  like  the 
following.  If  the  statement  is  correct,  the  respondent 
answers  “yes” ;  otherwise  he / she  answers  “no” . 

1.  I  have  the  sensitive  attribute  A. 

2.  I  do  not  have  the  sensitive  attribute  A. 

Respondents  use  a  randomizing  device  to  decide 
which  question  to  answer,  without  letting  the  inter¬ 
viewer  know  which  question  is  answered.  The  ran¬ 
domizing  device  is  designed  in  such  a  way  that  the 
probability  of  choosing  the  first  question  is  0,  and  the 
probability  of  choosing  the  second  question  is  1  —  9.  Al¬ 
though  the  interviewer  learns  the  responses  (e.g.,  “yes” 
or  “no”),  he/she  does  not  know  which  question  was 
answered  by  the  respondents.  Thus  the  respondents’ 
privacy  is  preserved.  Since  the  interviewer’s  interest  is 
to  get  the  answer  to  the  first  question,  and  the  answer 
to  the  second  question  is  exactly  the  opposite  to  the 
answer  for  the  first  one,  if  the  respondent  chooses  to 
answer  the  first  question,  we  say  that  he/she  is  telling 
the  truth;  if  the  respondent  chooses  to  answer  the  sec¬ 
ond  question,  we  say  that  he/she  is  telling  a  lie. 

To  estimate  the  percentage  of  people  who  has  the 
attribute  A,  we  have 

P*  (A  —  yes )  =  P(A  —  yes )  •  6  +  P(A  —  no)  •  (1  —  0) 

P*  ( A  —  no)  =  P(A  —  no)  •  6  +  P(A  —  yes)  •  (1  —  0), 


where  P*(A  =  yes )  (resp.  P*(A  =  no))  is  the 
proportion  of  the  “yes”  (resp.  “no”)  responses  ob¬ 
tained  from  the  survey  data,  and  P(A  =  yes)  (resp. 
P(A  =  no))  is  the  estimated  proportion  of  the  “yes” 
(resp.  “no”)  responses  to  the  sensitive  questions.  Get¬ 
ting  P(A  =  yes)  and  P(A  =  no)  is  the  goal  of  the 
survey.  By  solving  the  above  equations,  we  can  get 
P(A  =  yes)  and  P(A  =  no)  if  9  ^ 

The  randomized  response  technique  discussed 
above  considers  only  one  attribute.  However,  in  data 
mining,  data  sets  usually  consist  of  multiple  attributes; 
finding  the  relationship  among  these  attributes  is  one 
of  the  major  goals  for  data  mining.  Therefore,  we  need 
the  randomized  response  techniques  that  can  han¬ 
dle  multiple  attributes  while  supporting  various  data 
mining  computations.  Work  has  been  proposed  to 
deal  with  surveys  that  contain  multiple  questions  [11]. 
However,  their  solutions  can  only  handle  very  low  di¬ 
mensional  situation  (e.g.,  dimension  =  2),  and  cannot 
be  extended  to  solve  data  mining  problems,  in  which 
the  number  of  dimensions  is  usually  high.  We  have  de¬ 
veloped  a  multivariate  randomized  response  technique 
(MRR)  to  deal  with  multiple  attributes. 

3.1  Notations 

In  this  work,  we  assume  data  are  binary,  but  the  tech¬ 
niques  can  be  extended  to  categorical  data.  Suppose 
there  are  N  attributes  (A\,  . . .,  An)  in  a  data  set. 
Let  E  represent  any  logical  expression  based  on  those 
attributes  (e.g.,  E  =  ( Ai  =  1)  A  (A2  =  0));  let  E 
denote  the  logical  expression  that  reverses  the  l’s  in 
E  to  0’s  and  0’s  to  l’s;  we  call  E  the  opposite  of 
E.  For  example,  for  the  E  in  the  previous  example, 
E=(A1=0)A(A2  =  1). 

Let  P*(E)  be  the  proportion  of  the  records  in  the 
whole  disguised  data  set  that  satisfy  E  =  true.  Let 
P{E)  be  the  proportion  of  the  records  in  the  whole 
undisguised  data  set  that  satisfy  E  =  true  (the  undis¬ 
guised  data  set  contains  the  true  data,  but  it  does  not 
exist).  P*  (E)  can  be  observed  from  the  disguised  data, 
but  P(E),  the  actual  proportion  that  we  are  interested 
in,  cannot  be  observed  from  the  disguised  data  because 
the  undisguised  data  set  is  not  available  to  anybody; 
we  have  to  estimate  P(E).  The  goal  of  MRR  is  to  find 
a  way  to  estimate  P(E)  from  P*(E). 

In  our  multivariate  scheme,  we  also  divide  each 
expression  E  to  multiple  sub-expressions.  For  exam¬ 
ple,  in  a  two-group  scheme,  we  write  E  =  E\E2,  where 
Ei  contains  only  the  attributes  in  the  group  i. 

3.2  One-Group  Scheme 

In  the  one-group  scheme,  all  the  attributes  are  put  in 
the  same  group,  and  all  the  attributes  are  either  re¬ 
versed  together  or  keeping  the  same  values.  In  other 
words,  when  sending  the  private  data  to  the  central 


database,  users  either  tell  the  truth  about  all  their  an¬ 
swers  to  the  sensitive  questions  or  tell  the  lie  about  all 
their  answers.  The  probability  for  the  first  event  is  9, 
and  the  probability  for  the  second  event  is  1—9.  For  ex¬ 
ample,  assume  an  user’s  truthful  values  for  attributes 
Ai,  A 2,  and  A3  are  110.  The  user  generates  a  random 
number  from  0  to  1;  if  the  number  is  less  than  9,  he/she 
sends  110  to  the  data  collector  (i.e.,  telling  the  truth); 
if  the  number  is  bigger  than  0,  he/she  sends  001  to 
the  data  collector  (i.e.,  telling  lies  about  all  the  ques¬ 
tions).  Because  the  data  collector  does  not  know  the 
random  number  generated  by  users,  the  data  collec¬ 
tor  cannot  know  whether  data  provider  tells  the  truth 
or  a  lie.  To  simplify  our  presentation,  we  use  P(ll) 
to  represent  P(A\  =  1  A  A2  =  1),  P(00)  to  represent 
P{A\  =  0  AA2  =0)  (“A”  is  the  logical  and  operator.). 

Because  the  contributions  to  P*(ll)  and  P*(00) 
partially  come  from  P(ll),  and  partially  come  from 
P(00),  we  can  derive  the  following  equations: 

p*(ii)  =  p(n)  •  e  +  p(oo)  •  (1  -  0)  m 

P*(00)  =  P(00)  •  9  +  P(ll)  •  (1  -  0)  y  1 

By  solving  the  above  equations,  we  can  get  P(ll), 

the  information  needed  to  build  a  naive  Bayesian  clas¬ 
sifier.  The  general  model  for  the  one-group  scheme  is 
described  in  the  following: 

P*(P)  =  P(E)  ■  e  +  P(E)  •  (1  -  9)  (  ) 

P*  (E)  =  P(E)  ■  e  +  P(E)  •  (1  -9)  ^  1 

Using  the  matrix  form,  let  M\  denote  the  coefficiency 
matrix  of  the  above  equations,  and  let  p  =  6  and  q  = 
(1  —  9),  then 


3.3  Two-Group  Scheme 

In  the  one-group  scheme,  if  the  interviewer  somehow 
knows  whether  the  respondents  tell  a  truth  or  a  lie  for 
one  attribute,  he/she  can  immediately  obtain  all  the 
true  values  of  a  respondent’s  response  for  all  other  at¬ 
tributes.  To  improve  the  privacy  level  of  data,  data 
providers  divide  all  the  attributes  into  two  groups  (all 
the  data  providers  should  group  the  attributes  in  the 
same  ways,  e.g.,  one  user  lets  attribute  A\  and  A2  to 
be  in  the  group  1,  then  other  users  also  let  attribute 
A\  and  A2  to  be  in  the  group  1).  They  then  apply  the 
randomized  response  techniques  for  each  group  inde¬ 
pendently.  For  example,  the  users  can  tell  the  truth 
for  one  group  while  telling  the  lie  for  the  other  group. 
With  this  scheme,  even  if  the  interviewers  know  infor¬ 
mation  about  one  group,  they  will  not  be  able  to  derive 
the  information  for  the  other  group  because  they  are 
disguised  independently. 


To  show  how  to  estimate  P(EiE2),  we  look  at 
all  the  contributions  to  P*(EiE2).  Parts  that  con¬ 
tribute  to  P*(E1E2)  include  not  only  the  probability 
of  the  event  that  users  tell  the  truth  about  all  the  an¬ 
swers  for  both  groups  (i.e.,  P(EiE2)),  but  also  prob¬ 
abilities  of  all  other  events  (i.e.,  P(EiE2),  P(EiE2), 
and  P(E\E2)).  In  terms  of  9,  P(E\E2),  P(E\E2), 
P(EiE2)  and  P(E\E2)  are  respectively,  9 2,  9(  1  —  9), 
(1  —  9)9  and  (1  —  9)2.  We  then  have  the  following 
equation: 


P*(E1E2)=P(E1E2)  ■  82+P(E1E2)  ■  8(1  -  8)  + 

P(E1E2)  •  0(1  -  8)+P(E1E2)  •  (1  -  0)2 

There  are  four  unknown  variables  in  the  above  equa¬ 
tion  (P(E1E2),  P{E{E2),  P(ElE2),  P(E r^)).  To 
solve  the  above  equation,  we  need  three  more  equa¬ 
tions.  We  can  derive  them  using  the  similar  method. 
The  final  equations  are  described  in  the  following: 

(P”(E1E2)  \  /  P(Bi_E2) 

P*  {E\E2)  |  (  P(EiE2) 

p*(PiP2)  I  2'  I  P(E1E2) 

P*(ExE2)  )  \  P(EXE2) 

where  M2  is  the  coefficiency  matrix,  and  let  p  =  9 
and  g  =  l  —  0,  then, 


p2  pq  pq 
m  -  pi  p2  «2 
pq  q  p 
q2  pq  pq 

Since  two-group  scheme  is  sufficient  for  naive  Bayesian 
classification  computations,  we  will  not  show  the  esti¬ 
mation  model  for  the  cases  where  the  group  number  is 
greater  than  two. 

3.4  Building  Naive  Bayesian  Classifiers 

Classification  is  one  of  the  forms  of  data  analysis  that 
can  be  used  to  extract  models  describing  important 
data  classes  or  to  predict  future  data.  It  has  been  stud¬ 
ied  extensively  by  the  community  in  machine  learning, 
expert  system,  and  statistics  as  a  possible  solution  to 
the  knowledge  discovery  problem.  Classification  is  a 
two-step  process.  First,  a  model  is  built  given  the  in¬ 
put  of  training  data  set  which  is  composed  of  data 
tuples  described  by  attributes.  Each  tuple  is  assumed 
to  belong  to  a  predefined  class  described  by  one  of  the 
attributes,  called  the  class  label  attribute.  Second, 
the  predictive  accuracy  of  the  model  (or  classifier)  is 
estimated.  A  test  set  of  class-labeled  samples  is  usu¬ 
ally  applied  to  the  model.  For  each  test  sample,  the 
known  class  label  is  compared  with  predictive  result  of 
the  model. 

The  naive  Bayesian  classifier  is  one  of  the  most 
successful  algorithms  on  many  classification  domains. 
Despite  of  its  simplicity,  it  is  shown  to  be  competitive 
with  other  complex  approaches  especially  in  text  cat¬ 
egorization  and  content  based  filtering.  Under  a  con¬ 
ditional  independence  assumption,  i.e.,  P(Aj,Aj\C) 
=  P(Ai\C)P(Aj\C),  for  1  <  i  ^  j  <  n,  the  naive 
Bayesian  classifier  classifies  a  new  data  x  into  the 
class  with  the  largest  posterior  probability  as  shown 


q 

pq 

pq 

p2 


(5) 


in  Eq.  6,  where  At  and  Aj  represent  the  attributes 
or  variable,  C  is  the  class  variable,  n  is  the  number 
of  the  attributes.  Further,  this  posterior  classification 
rule  can  be  transformed  into  joint  probability  classifi¬ 
cation  rule,  since  P(A\,  A2,  ■  ■  • ,  An)  for  a  given  data 
is  a  constant  with  regards  to  C.  Finally,  combining 
the  independence  assumption,  the  classification  rule  is 
changed  into  a  decomposable  form. 


c  =  argmaxc ■  P{C%\A\,  A2,  •  •  •  ,  An) 


—  argmaxc  • 


P(Ci)*P(A1,A2,--,An\Ci) 

P(A1,A2,---,An) 


—  argmaxc i  P{C%)  *  P(A\,  A2,  ■  ■  ■  ,  An\Ci)  (6) 


=  argmaxc  ^  P(Ci)H^=1  P(Aj \Ci) 

=  argmaxc  i  P(Ci)Tl™=1 

To  build  the  NB  classifier,  we  need  to  compute 
P{Ci)  and  P{Aj1Ci).  Without  loss  of  generality,  we 
assume  the  database  only  contains  binary  values,  and 
we  will  show  how  to  compute  these  terms  based  on 
disguised  training  datasets. 

Let  E  be  a  logical  expression  based  on  attributes. 
Let  P(E)  be  the  proportion  of  the  records  in  the  undis¬ 
guised  data  set  (the  true  but  non-existing  data  set) 
that  satisfy  E  =  true.  Because  of  the  disguise,  P(E) 
cannot  be  observed  directly  from  the  disguised  data, 
and  it  has  to  be  estimated.  Let  P*(E)  be  the  propor¬ 
tion  of  the  records  in  the  disguised  data  set  that  satisfy 
E  =  true.  P*(E)  can  be  computed  directly  from  the 
disguised  data. 

To  compute  P(Ci),  we  can  utilize  one-group 
model  (Eq.  2)  with  E  =  C)  and  E  =  Cj,  and  P*{E) 
and  P*  (E)  can  be  computed  directly  from  the  (whole) 
disguised  data  set.  Therefore,  by  solving  the  above 
equations  (when  9  ^  i),  we  can  get  P(E)  which  is 
P{Ci)  in  this  case. 

To  compute  P(Aj,  C,),  we  need  to  know  whether 
Aj  and  C,  belong  to  the  same  group.  If  they  come 
from  the  same  group,  we  can  still  use  estimation  model 
(Eq.  2)  with  E  =  (Aj  A  Cj)  and  E  =  (Aj  A  C,). 
However,  if  Aj  and  C)  belong  to  different  groups,  we 
need  to  utilize  the  estimation  model  for  the  two-group 
scheme  (Eq.  4)  with  E\  =  Aj,  E\  =  Aj,  E2  =  Ci  and 
E2  =  Ci .  Once  we  obtain  P(Aj,Ct)  and  P(C) ),  a  NB 
classifier  can  be  constructed. 


3.5  Testing 

Conducting  the  testing  is  straightforward  when  data 
are  not  disguised,  but  it  is  a  non-trivial  task  when  the 
testing  data  set  is  disguised.  Imagine,  when  we  choose 
a  record  from  the  testing  data  set,  compute  a  predicted 
class  label  using  the  naive  Bayesian  classifier,  and  find 
out  that  the  predicated  label  does  not  match  with  the 
record’s  actual  label,  can  we  say  this  record  fails  the 
testing?  If  the  record  is  a  true  one,  we  can  make  that 
conclusion,  but  if  the  record  is  a  false  one  (due  to  the 
randomization),  we  cannot.  How  can  we  compute  the 
accuracy  score  of  the  NB  classifier? 


We  also  use  the  randomized  response  techniques 
to  compute  the  accuracy  score.  For  simplicity,  we  only 
describe  how  to  conduct  testing  using  the  two-group 
scheme  (since  one-group  is  a  special  case  for  two-group 
scheme) .  We  use  an  example  to  illustrate  how  we  com¬ 
pute  the  accuracy  score.  Assume  the  number  of  at¬ 
tributes  is  2,  and  the  probability  9  =  0.8.  To  test  a 
record  (A\  =  1,  A2  =  0)  (denoted  by  10),  with  A\  be¬ 
longing  to  group  1  and  A  2  belonging  to  group  2,  we 
feed  10,  11,  00,  01  to  the  classifier.  We  know  one  of  the 
class-label  prediction  result  is  true,  but  don’t  exactly 
know  which  one.  However,  with  enough  testing  data, 
we  can  estimate  the  total  accuracy  score,  even  though 
we  do  not  know  which  test  case  produces  the  correct 
prediction  result. 

Using  the  (disguised)  testing  data  set  S  =  S1S2, 
we  construct  other  data  sets  S1S2,  S'i  S2 ,  S1S2 ,  by  re¬ 
versing  the  corresponding  values  in  Si  and  S2  (change 
0  to  1  and  1  to  0).  Note  that  each  record  in  Si  (for 
i  €  [1,2])  is  the  opposite  of  the  corresponding  record 
in  Si.  We  say  that  Si  is  the  opposite  of  the  data  set 
Si .  Similarly,  we  define  Ui  as  the  original  undisguised 
testing  data  set,  and  Ui  as  the  opposite  of  Ui. 

Let  P*(cc)  be  the  proportion  of  correct  predic¬ 
tions  from  testing  data  set  S1S2,  P*(cc)  be  the  propor¬ 
tion  of  correct  predictions  from  testing  data  set  S1S2, 
•  •  • ,  P*  (cc)  be  the  proportion  of  correct  predictions 
from  testing  data  set  S1S2.  Similarly,  let  P(cc)  be 
the  proportion  of  correct  predictions  from  the  original 
undisguised  data  set  U1U2,  P(cc)  be  the  proportion 
of  correct  predictions  from  U1U2,  P{cc)  be  the 
proportion  of  correct  predictions  from  U-JJ2-  P(cc)  is 
what  we  want  to  estimate. 

Because  P*(cc),  P*(cc),  •••  and  P*(cc)  consist 
of  contributions  from  P(cc),  P(cc),  •  •  •  and  P(cc),  we 
have  the  following  formula: 


1  P*(cc)  \ 

(  P(cc)  \ 

P*  (cc) 

P(cc)  | 

P*  (cc) 

=  m2  • 

P(cc) 

P*(cc) 

P(cc)  1 

V  ) 

'v  J 

where  M2  is  defined  in  Eq.(4).  P*(cc),  P*(cc),  P*(cc) 
and  P*(cc)  can  be  obtained  from  testing  data  set  S1S2, 
S1S2,  S1S2  and  S1S2.  By  solving  the  above  formula, 
we  can  get  P(cc),  the  accuracy  score  of  testing. 

4  Experimental  Results 

To  evaluate  the  effectiveness  of  our  multivariate  ran¬ 
domized  response  techniques  on  naive  Bayesian  clas¬ 
sifier,  we  compare  the  classification  accuracy  of  our 
multivariate  scheme  with  the  original  accuracy,  which 
is  defined  as  the  accuracy  of  the  classifier  induced  from 
the  original  data. 

4.1  Data  Setup 

We  conduct  experiments  on  two  real  life  data  sets. 
We  obtain  the  data  sets  from  the  UCI  Machine  Learn¬ 


ing  Repository  (ftp://ftp.ics. uci.edu/pub/machine- 
learning-databases) .  The  first  dataset  is  called  Adult. 
It  contains  48842  instances  with  14  attributes  (6 
continuous  and  8  nominal)  and  a  label  describing  the 
salary  level.  Prediction  task  is  to  determine  whether 
a  person’s  income  exceeds  50k/year  based  on  census 
data.  We  used  first  10,000  instances  in  our  experi¬ 
ment.  The  second  data  set  is  called  Breast- Cancer.  It 
has  699  instances  with  10  attributes.  Prediction  task 
is  to  decide  whether  a  person  is  benign  or  malignant. 

We  modified  the  naive  Bayesian  classification  al¬ 
gorithm  to  handle  the  randomized  data  based  on  our 
proposed  methods.  We  run  this  modified  algorithm 
on  the  randomized  data  and  obtain  a  classifier.  We 
also  apply  the  naive  Bayesian  classification  algorithm 
to  the  original  data  set  and  obtain  the  other  classi¬ 
fier.  We  then  applied  the  same  testing  data  to  both 
classifiers.  Our  goal  is  to  compare  the  classification  ac¬ 
curacy  of  these  two  classifier.  Obviously  we  want  the 
accuracy  of  the  classifier  built  based  on  our  method  to 
be  close  to  the  accuracy  of  the  classifier  built  from  the 
original  algorithm. 

4.2  Experimental  Steps 

Our  experiments  consist  of  the  following  steps: 

Preprocessing:  Since  we  assume  that  the  data  set 
contains  only  binary  data,  we  first  transformed  the 
original  non-binary  data  to  the  binary.  We  split  the 
value  of  each  attribute  from  the  median  point  of  the 
range  of  the  attribute.  After  preprocessing,  we  divided 
the  data  sets  into  a  training  data  set  D  and  a  testing 
data  set  B.  Note  that  B  will  be  used  for  comparing 
our  results  with  the  benchmark  results. 

Benchmark:  We  use  D  and  the  original  NB  classi¬ 
fication  algorithm  to  build  a  classifier  we  use  the 
data  set  B  to  test  the  classifier,  and  get  an  accuracy 
score.  We  call  this  score  the  original  accuracy  (or  the 
benchmark  score). 

9  Selection:  For  9  =  0.0, 0.1,  0.2,  0.3,  0.4,  0.45,  0.51, 
0.55  0.6,  0.7,  0.8,  0.9,  and  1.0,  we  conduct  the  following 
4  steps: 

1.  Randomization:  We  create  a  disguised  data  set 
G.  For  each  record  in  the  training  data  set  D ,  we 
generate  a  random  number  r  from  0  to  1  using 
uniform  distribution.  If  r  <  9 ,  we  copy  the  record 
to  G  without  any  change;  if  r  >  9 ,  we  copy  the 
opposite  of  the  record  to  G  -  each  attribute  value 
of  the  record  that  we  put  into  G  is  exactly  the 
opposite  of  the  value  in  the  original  record.  We 
perform  this  randomization  step  for  all  the  records 
in  the  training  data  set  D  and  generate  the  new 
data  set  G. 


Plot  of  Means  For  The  Adult  Dataset 


Plot  of  Variance  For  The  Adult  Dataset 


(a)  Mean  (b)  Variance 

Figure  1.  The  Results  On  The  Adult  Data  Set 


Plot  of  Means  For  The  Cancer  Dataset 


(a)  Mean 


Plot  of  Variance  For  The  Cancer  Dataset 


(b)  Variance 


Figure  2.  The  Results  On  The  Breast-Cancer  Data  Set 


2.  Classifier  Construction:  We  use  the  data  set  G 
and  our  modified  NB  classification  algorithm  to 
build  a  naive  Bayesian  classifier  Tq. 

3.  Testing:  We  use  the  data  set  B  to  test  Tq,  and 
we  get  an  accuracy  score  S. 

4.  Repetition:  We  repeat  steps  1-3  for  100  times, 
and  get  Si , ... ,  Sioo ■  We  then  compute  the  mean 
and  the  variance  of  these  100  accuracy  scores. 

4.3  The  Result  Analysis 

4.3.1  The  Analysis  of  Mean 

Fig.  1(a)  and  2(a)  shows  the  mean  values  of  the  ac¬ 
curacy  scores  for  Adult  and  Breast- Cancer  data  sets 
respectively.  We  can  see  from  the  figures  that  when 
9=1  and  0  =  0,  the  results  are  exactly  the  same  as 
the  results  when  the  original  classification  algorithm  is 
applied.  This  is  because  when  0  =  1,  the  randomized 
data  sets  are  exactly  the  same  as  the  original  data  set 


D\  when  0  =  0,  the  randomized  data  sets  are  exactly 
the  opposite  of  the  original  data  set  D.  In  both  cases, 
our  algorithm  produces  the  accurate  results  (compar¬ 
ing  to  the  original  algorithm),  but  privacy  is  not  pre¬ 
served  in  either  case  because  an  adversary  can  know 
the  real  values  of  all  the  records  provided  that  he/she 
knows  the  0  value.  When  0  moves  from  1  and  0  towards 
0.5,  the  mean  of  accuracy  has  the  trend  of  decreasing. 
When  0  is  around  0.5,  the  mean  deviates  a  lot  from 
the  original  accuracy  score. 

4.3.2  The  Analysis  of  Variance 

Fig.  1(b)  and  2(b)  shows  the  variances  of  the  accuracy 
scores  for  Adult  and  Breast- Cancer  data  sets  respec¬ 
tively.  When  0  moves  from  1  and  0  towards  0.5,  the  de¬ 
gree  of  randomness  in  the  disguised  data  is  increased, 
the  variance  of  the  estimation  used  in  our  method  be¬ 
comes  large.  When  the  randomization  level  0  is  differ¬ 
ent,  the  variance  will  be  different.  When  0  is  near  0.5, 
the  randomization  level  is  much  higher  and  true  infer- 


mation  about  the  original  data  set  is  better  disguised, 
in  other  words,  more  information  is  lost;  therefore  the 
variance  is  much  larger  than  the  case  when  9  is  not 
around  0.5.  This  is  actually  what  we  have  predicted. 
We  use  a  simple  example  to  illustrate  why  this  hap¬ 
pens.  Assume  we  have  just  one  attribute,  with  90%  of 
l’s  and  10%  of  0’s.  If  we  choose  9  =  0.5,  according  to 
our  randomization  scheme,  the  disguised  data  set  will 
contain  90%  *  0.5  +  10%  *  0.5  =  50%  of  l’s  and  another 
50%  of  0’s.  If  we  change  the  distribution  to  10%  of  l’s 
and  90%  of  0’s,  we  get  the  same  results.  This  means 
when  9  =  0.5,  information  about  the  data  distribution 
is  lost.  That  is  why  when  9  closes  to  0.5  the  accuracy 
becomes  very  low  (Note  that  0.5  is  a  very  low  accuracy, 
because  if  one  just  randomly  guesses  the  class  label,  1 
out  2  guesses  will  be  correct  if  we  have  just  two  class 
labels.  Therefore  even  the  random  guess  can  achieve 
accuracy  of  O.5.),  and  the  variance  becomes  very  large. 

4.3.3  Summary 

Our  results  on  the  two  real  life  data  sets  indicate  that 
the  multivariate  randomized  response  techniques  can 
be  utilized  for  privacy-preserving  naive  Bayesian  clas¬ 
sification.  When  9  is  0  or  1,  which  provides  all  the 
true  information,  the  accuracy  of  the  classifier  is  the 
highest  and  the  privacy  level  of  the  data  is  the  lowest. 
When  9  is  away  from  0  (or  1)  and  approaches  to  0.5, 
the  accuracy  of  the  classifier  decreases  and  the  privacy 
level  of  the  data  increases.  The  accuracy  is  depen¬ 
dent  on  the  recoverability  of  the  original  data  from  the 
randomized  data.  The  empirical  results  confirm  that 
recoverability  and  privacy  are  complementary  goals, 
and  that  research  presented  here  allows  a  quantitative 
evaluation  of  the  trade-offs  between  the  two,  e.g.,  if 
the  recoverability  of  the  original  data  is  20%,  privacy 
will  be  at  most  80%. 

When  9  =  0.5,  the  related  model  cannot  be  ap¬ 
plied,  and  other  techniques  such  as  randomized  re¬ 
sponse  techniques  using  the  Unrelated- Question  model 
may  be  employed.  Note  that  In  our  experiment,  we 
didn’t  randomize  the  class  label  and  therefore  only 
one-group  scheme  is  implemented. 

5  Concluding  Remarks 

In  this  paper,  we  have  presented  a  method  to  build 
naive  Bayesian  classifiers  using  multivariate  random¬ 
ized  response  technique.  The  experimental  results 
show  that  when  we  select  the  randomization  parameter 
9  from  [0.6,1]  and  [0,0.4],  we  can  get  fairly  accurate 
classifiers  comparing  to  the  classifiers  built  from  the 
undisguised  data.  In  our  future  work,  We  will  apply 
our  techniques  to  solve  other  data  mining  problems 
(i.e.,  association  rule  mining)  and  extend  our  solution 
to  deal  with  the  cases  where  data  type  is  not  binary. 
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